About the website

This website serves as a course diary for a Uni. Helsinki course Introduction to Open Data Science (IODS)

Code repository

The R-code used is stored and available for inspection in this GitHub repository


Linear regression exercise

Data wrangling

Script for data wrangling is here

Load the data

#Hidden
source("helper_functions.R")
library(knitr)
library(kableExtra)
library(ggplot2)
library(GGally)
library(tidyr)
library(dplyr, warn.conflicts = F)
library(stringr)
analysis <- read.csv("data/learning2014.csv") %>% tbl_df()
glimpse(analysis)
## Observations: 166
## Variables: 7
## $ gender   <fctr> F, M, F, M, M, F, M, F, M, F, M, F, F, F, M, F, F, F...
## $ Age      <int> 53, 55, 49, 53, 49, 38, 50, 37, 37, 42, 37, 34, 34, 3...
## $ Attitude <int> 37, 31, 25, 35, 37, 38, 35, 29, 38, 21, 39, 38, 24, 3...
## $ deep     <dbl> 3.636364, 3.000000, 3.045455, 3.500000, 3.681818, 4.3...
## $ stra     <dbl> 3.375, 2.750, 3.625, 3.125, 3.625, 3.625, 2.250, 4.00...
## $ surf     <dbl> 2.583333, 3.166667, 2.250000, 2.250000, 2.833333, 2.4...
## $ Points   <int> 25, 12, 24, 10, 22, 21, 21, 31, 24, 26, 31, 31, 23, 2...

The data contains learning results (points) of 166 students and the possibly associated variables: attitude, age, gender. Also icluded are students’ likert-scores on following dimensions of learning : deep, surface/superficial and strategic.

Complete description of study variables

Variables starting with ST, SU, and Dnum present dimensions strategic, superficial, and deep respectively.

#Hidden
info <- read.delim("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-meta.txt", 
           header = F, sep = "\t", encoding = "latin1")[12:71,] %>% 
  str_split(boundary(type = "word"), n = 2, simplify = T) %>% as.data.frame()
names(info) <- c("Variable", "Description") 
info %>%
  kable("html", align = "rrr", caption = "Data variable info") %>% 
  kable_styling(bootstrap_options = c("hover", "condensed")) %>%
  scroll_box(height = "200px")
Data variable info
Variable Description
Aa Making sure you remember things well
Ab Developing as a person
Ac Building up knowledge by acquiring facts and information
Ad Being able to use the information you’ve acquired
Ae Understanding new material for yourself
Af Seeing things in a different and more meaningful way
ST01 I manage to find conditions for studying which allow me to get on with my work easily.
SU02 Often I find myself wondering whether the work I’m doing here is really worthwhile.
D03 I usually set out to understand for myself the meaning of what we have to learn.
ST04 I organize my study time carefully to make the best use of it.
SU05 I find I have to concentrate on just memorising a good deal of what I have to learn.
D06 I look at the evidence carefully and try to reach my own conclusion about what I’m studying.
D07 I try to relate ideas I come across to those in other topics or other course whenever possible.
SU08 I tend to read very little beyond what is actually required to pass.
ST09 I think I’m quite systematic and organised when it comes to revising for exams.
SU10 There’s not much of the work here that I find interesting or relevant.
D11 When I’m reading an article or book, I try to find out for myself exactly what the author means.
ST12 I’m pretty good at getting down to work whenever I need to.
SU13 Much of what I’m studying makes little sense: it’s like unrelated bit and pieces.
D14 When I’m working on a new topic, I try to see in my own mind how all the ideas fit together.
D15 Often I find myself questioning things I hear in lectures or read in books.
SU16 I concentrate to learning just those bits of information I have to know to pass.
ST17 I’m good at following up some of the reading suggested by lecturers or tutors.
SU18 When I look back, I sometimes wonder why I ever decided to come here.
D19 When I’m reading I stop from time to time to reflect on what I am trying to learn from it.
ST20 I work steadily through the term or semester, rather than leave it all until the last minute.
SU21 I’m not really sure what’s important in lectures, so I try to get down all I can.
D22 Ideas in course books or articles often set me off on long chains of thought of my own.
D23 When I read, I examine the details carefully to see how they fit in with what’s being said.
SU24 I gear my studying closely to just what seems to be required for assignments and exams.
ST25 I usually plan out my week’s work in advance, either on paper or in my head.
SU26 I’m not really interested in this course, but I have to take it for other reasons.
D27 Before tackling a problem or assignment, I first try to work out what lies behind it.
ST28 I generally make good use of my time during the day.
SU29 I often have trouble in making sense of the things I have to remember.
D30 I like to play around with ideas of my own even if they don’t get me very far.
D31 It’s important for me to be able to follow the argument, or to see the reason behind things.
SU32 I like to be told precisely what to do in essays or other assignments.
Ca Lecturers who tell us exactly what to put down in our notes.
Cb Lecturers who encourage us to think for ourselves and show us how they themselves think.
Cc Exams which hallow me to show that I’ve thought about the course material for myself.
Cd Exams or tests which need only the material provided in our lecture notes.
Ce Courses in which it’s made very clear just which books we have to read.
Cf Courses where we’re encouraged to read around the subject a lot for ourselves.
Cg Books which challenge you and provide explanations which go beyond the lectures.
Ch Book which give you definite facts and information which can easily be learned.
Da I feel confident I can master statistics topics
Db Statistics will be useful for my future work
Dc I am interested in understanding statistics
Dd I did well in high school mathematics courses
De I am interested in learning statistics
Df I feel insecure when I have to do statistics problems
Dg I like statistics
Dh I am scared by statistics
Di I feel confident in doing math operations
Dj Statistics will be useful for my future studies
Age Age (in years) derived from the date of birth
Attitude Global attitude toward statistics
Points Exam points
gender Gender: M (Male), F (Female)

Exploratory analysis of study variables

#Hidden
ggpairs(analysis,
  title = "Study variable overview",
  upper = list(continuous = wrap("cor", size = 3)), 
  lower = list(continuous = wrap("points", alpha = .2, size = .6),
               combo = wrap("facethist", bins = 10))) +
  theme(
    axis.text.x = element_text(angle = 90, color = "black", size = 7, vjust = .5),
    axis.text.y = element_text(color = "black", size = 7))

#Hidden
summaryKable(analysis) %>% 
  kable("html", align = "rrr", caption = "Data variable summary") %>% 
  kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
  scroll_box(width = "100%")
Data variable summary
gender Age Attitude deep stra surf Points
Min Levels 17.000 14.000 1.818 1.250 1.583 7.000
1st Q F: 110 21.000 26.000 3.091 2.625 2.417 19.000
Median M: 56 22.000 32.000 3.455 3.188 2.833 23.000
Mean 25.512 31.428 3.436 3.121 2.787 22.717
3rd Q 27.000 37.000 3.727 3.625 3.167 27.750
Max 55.000 50.000 4.455 5.000 4.333 33.000

All variables seem to be normally distributed. There is quite a bit more female subjects than males (110 vs 56). By eyeballing histograms above, there seems to be no great differences between genders, except maybe for the attitude-variable. Points, our dependent variable, has a bit of a notch on the left tail, but still reasonably follows normality. For variable age, logarithmic conversion might help to increase normality.

Generating linear model for predicting points

Model #1

I chose to use attitude, deep and stra as explanatory variables for points, as they were most strongly correlated with it according to the exploratory analysis.

model <- lm(Points ~ Attitude + deep + stra, data = analysis)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude + deep + stra, data = analysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.39145    3.40775   3.343  0.00103 ** 
## Attitude     0.41497    0.08882   4.672 6.24e-06 ***
## deep        -1.37353    1.37621  -0.998  0.31974    
## stra         0.96208    0.53668   1.793  0.07489 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

In this model only attitude significantly predicts points, so other 2 variables are excluded. Low p-value for deep-learning probably arises from the fact that Attitude and deep learning are highly correlated (r=0.8)

Model #2

model <- lm(Points ~ Attitude, data = analysis)
summary(model)
## 
## Call:
## lm(formula = Points ~ Attitude, data = analysis)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

So the attitude of the student seems to correlate with how much points (s)he will score in the test; better attitude predicts better scores. R-squared value explains how big proportion of variance of the dependent variable (points) is explained by the model. In this case only 18.5% of the variability in points is explained by the proposed model.

Diagnostic plots

par(mfrow = c(2,2), oma = c(0, 0, 2, 0), mar = c(2.5,3,2,0.5), mgp = c(1.5,.5,0))
plot(model, which = c(1,2), add.smooth = T)
autoimage::reset.par()

norm.res <- model$residuals/(sqrt(deviance(model)/df.residual(model))*sqrt(1-hatvalues(model)))
# Counted the normalized residuals long way for fun. Following code can be used to check results
# sum(norm.res != rstandard(model))

aa <- analysis$Attitude
leverage <- (aa-mean(aa))^2/sum((aa-mean(aa))^2)+1/length(aa)

plot(leverage, norm.res, xlab = "Leverage", ylab = "Standardized residuals")

plot(cooks.distance(model), norm.res, xlab = "Cook's distance", ylab = "Standardized residuals")

Conclusions

According to diagnostic plots, this model has no critical errors:
1. Residuals are the difference between fitted and the actual value. In this plot we see no clustering, or any other patterns, that could indicate problems in the model. Variance of errors is constant
2. On Q-Q plot, on the extreme values, the model loses some of its accuracy. In other words, the model overestimates performance of students with either very positive or negative attitude. Shapiro-Wilkes test doesn’t agree with normality, however: “shapiro.test(rstandard(model))$p.value)” which gives p = 0.0032512. Erros in this model are distributed normally enough.
3. Leverage describes the unusualness of predictor values. For any individual variable, it tells how extreme, eg. how far of variable mean any particular observation is. For multivariate model, these “hatvalues” take into consideration “combined unusualness” across variables – observation might not be far from mean in any single variable, but combination of those values might be. In this model, no observations have high leverage.
4. Cook’s distance tells how much fitted values would change if the observation in question is deleted – it can be used to diagnose and remove influential outliers. No single observation affects model too much.


Logistic regression exercise

Data wrangling

The code used for data wrangling exercise can be found here

Load data

#Hidden
library(knitr)
library(kableExtra)
library(ggplot2)
library(GGally)
library(tidyr)
library(stringr)
library(dplyr, warn.conflicts = F)
source("helper_functions.R")
library(pROC)
df <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt", sep=",", head = T) %>% tbl_df()
#Hidden
summaryKable(df) %>% 
  kable("html", align = "rrr", caption = "Data variable summary") %>%
  kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
  scroll_box(width = "100%")
Data variable summary
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery internet guardian traveltime studytime failures schoolsup famsup paid activities higher romantic famrel freetime goout Dalc Walc health absences G1 G2 G3 alc_use high_use
Min Levels Levels 15.000 Levels Levels Levels 0.000 0.000 Levels Levels Levels Levels Levels Levels 1.000 1.000 0.000 Levels Levels Levels Levels Levels Levels 1.000 1.000 1.000 1.000 1.000 1.000 0.000 3.000 0.000 0.000 1.000 0.000
1st Q GP: 342 F: 198 16.000 R: 81 GT3: 278 A: 38 2.000 2.000 at_home: 53 at_home: 16 course: 140 no: 72 no: 58 father: 91 1.000 1.000 0.000 no: 331 no: 144 no: 205 no: 181 no: 18 no: 261 4.000 3.000 2.000 1.000 1.000 3.000 0.000 8.000 8.250 8.000 1.000 0.000
Median MS: 40 M: 184 17.000 U: 301 LE3: 104 T: 344 3.000 3.000 health: 33 health: 17 home: 110 yes: 310 yes: 324 mother: 275 1.000 2.000 0.000 yes: 51 yes: 238 yes: 177 yes: 201 yes: 364 yes: 121 4.000 3.000 3.000 1.000 2.000 4.000 3.000 10.500 11.000 11.000 1.500 0.000
Mean 16.586 2.806 2.565 other: 138 other: 211 other: 34 other: 16 1.442 2.034 0.291 3.940 3.223 3.113 1.474 2.280 3.579 5.319 10.861 10.712 10.387 1.877 0.293
3rd Q 17.000 4.000 4.000 services: 96 services: 107 reputation: 98 2.000 2.000 0.000 5.000 4.000 4.000 2.000 3.000 5.000 8.000 13.000 13.000 14.000 2.500 1.000
Max 22.000 4.000 4.000 teacher: 62 teacher: 31 4.000 4.000 3.000 5.000 5.000 5.000 5.000 5.000 5.000 75.000 19.000 19.000 20.000 5.000 1.000

So, the data explores school performance of portuguese students. This set contains information of 382 students, their grades and several variables that might correlate with performance.

Hypothesis of high alcohol use predicting variables

I thought that good family relations and extra-curricular activities would protect from high alcohol usage. Number of absences and going more out with friends would, conversely, increase alcohol usage.

mod <- df %>% select(high_use, famrel, activities, absences, goout) 
mod
## # A tibble: 382 x 5
##    high_use famrel activities absences goout
##       <lgl>  <int>     <fctr>    <int> <int>
##  1    FALSE      4         no        6     4
##  2    FALSE      5         no        4     3
##  3     TRUE      4         no       10     2
##  4    FALSE      3        yes        2     2
##  5    FALSE      4         no        4     2
##  6    FALSE      5        yes       10     2
##  7    FALSE      4         no        0     4
##  8    FALSE      4         no        6     4
##  9    FALSE      4         no        0     2
## 10    FALSE      5        yes        0     1
## # ... with 372 more rows
mod %>% select(famrel:goout) %>% 
  sapply(., function(variable_value) table(mod$high_use, variable_value))
## $famrel
##        variable_value
##           1   2   3   4   5
##   FALSE   7   9  40 131  83
##   TRUE    2   9  26  52  23
## 
## $activities
##        variable_value
##          no yes
##   FALSE 122 148
##   TRUE   59  53
## 
## $absences
##        variable_value
##          0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
##   FALSE 94  2 54  2 36  4 19  6 13  3 12  1  6  0  7  1  2  0  2  0  1  1
##   TRUE  22  1 13  4 15  0 11  2  8  0  5  1  4  3  4  2  4  1  2  1  1  0
##        variable_value
##         22 23 24 25 26 28 30 54 56 75
##   FALSE  0  1  0  1  1  0  0  0  0  1
##   TRUE   3  0  1  0  0  1  1  1  1  0
## 
## $goout
##        variable_value
##           1   2   3   4   5
##   FALSE  21  84 100  43  22
##   TRUE    3  15  23  39  32
#Hidden
p1 <-  ggplot(mod, aes(high_use, famrel)) + geom_boxplot() 
p2 <-  ggplot(mod, aes(high_use, fill = activities)) + geom_bar()
p3 <-  ggplot(mod, aes(high_use, absences)) + geom_boxplot() 
p4 <-  ggplot(mod, aes(high_use, goout)) + geom_boxplot() 

multiplot(p1,p2,p3,p4, cols = 2)

These values are well in line with the proposed hypothesis, except for the activities; it seems that extra-curricular activities do not protect from high alcohol usage.

Logistic regression

Model

m <- glm(high_use ~ famrel + activities + absences + goout, data = df, family = binomial)
summary(m) 
## 
## Call:
## glm(formula = high_use ~ famrel + activities + absences + goout, 
##     family = binomial, data = df)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9971  -0.7713  -0.5066   0.8710   2.4369  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.16563    0.63289  -3.422 0.000622 ***
## famrel        -0.39022    0.13579  -2.874 0.004059 ** 
## activitiesyes -0.42269    0.25302  -1.671 0.094804 .  
## absences       0.05834    0.01630   3.579 0.000345 ***
## goout          0.81149    0.12286   6.605 3.98e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 385.49  on 377  degrees of freedom
## AIC: 395.49
## 
## Number of Fisher Scoring iterations: 4

Based on the logistic regression model shown above, this hypothesis seems to hold true for all proposed factors

Details of model

coef(m)
##   (Intercept)        famrel activitiesyes      absences         goout 
##   -2.16563017   -0.39021678   -0.42268545    0.05833905    0.81149225
OR <- coef(m) %>% exp()
CI <- suppressMessages(confint(m)) %>% exp() 
cbind(OR, CI) %>% round(3)
##                  OR 2.5 % 97.5 %
## (Intercept)   0.115 0.032  0.387
## famrel        0.677 0.517  0.882
## activitiesyes 0.655 0.397  1.073
## absences      1.060 1.029  1.098
## goout         2.251 1.782  2.887

A step towards better family relationship is associated with odds .731, meaning that a risk for high alcohol usage is reduced. Growing number of absences, going out, and a lack of extra-curricular activities tend to lead to more high alcohol usage.

Predictions

df <- mutate(df, prob = predict(m, type = "response"))
plot.roc(df$high_use, df$prob)

df <- mutate(df, pred = prob > .4)
table(high_use = df$high_use, prediction = df$pred)
##         prediction
## high_use FALSE TRUE
##    FALSE   229   41
##    TRUE     47   65
table(high_use = df$high_use, prediction = df$pred) %>%
  prop.table() %>% `*`(100) %>% round(2) %>% addmargins()
##         prediction
## high_use  FALSE   TRUE    Sum
##    FALSE  59.95  10.73  70.68
##    TRUE   12.30  17.02  29.32
##    Sum    72.25  27.75 100.00

Based on receiver-operating-curve, the threshold for classification should be a little over .3 for optimal accurasy. With a threshold of .4, roughly 77% of predictions are correct. This better than what you should expect from simple toin cossing scenario, so there is some benefit in using our model.

10-fold CV

library(boot)
loss_func <- function(class, prob) {
  # Adjusted for threshold .4
  n_wrong <- abs(class - prob - .1) > .5
  mean(n_wrong)
}
loss_func(df$high_use, df$prob)
## [1] 0.2303665
# Easier func for counting wrong
1-(mean(df$high_use == df$pred))
## [1] 0.2303665
cv <- cv.glm(data = df, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2565445

So with 10-fold cross-validation, accuracy decreases from ~77% to ~75%


Clustering and classification

2 Loading data

#Hidden
source("helper_functions.R")
library(magrittr)
library(GGally)
library(ggplot2)
library(stringr)
library(tidyr)
library(MASS)
library(knitr)
library(kableExtra)
library(corrplot)
library(plotly)
library(dplyr)
data("Boston")
df <- Boston %>% tbl_df() %>% 
  mutate_at(vars(chas), funs(as.factor))
glimpse(df)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <fctr> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

14 variables, for which complete description can be found here. In short, the dataset has information of housing values in different suburbs of Boston. Study includes multiple variables describing safety, housing density, and accessibility, for example.

3 Exploring data

Start

#Hidden
summaryKable(df) %>% 
  kable("html", align = "rrr", caption = "Data variable summary") %>%
  kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
  scroll_box(width = "100%")
Data variable summary
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min 0.006 0.000 0.460 Levels 0.385 3.561 2.900 1.130 1.000 187.000 12.600 0.320 1.730 5.000
1st Q 0.082 0.000 5.190 0: 471 0.449 5.886 45.025 2.100 4.000 279.000 17.400 375.377 6.950 17.025
Median 0.257 0.000 9.690 1: 35 0.538 6.208 77.500 3.207 5.000 330.000 19.050 391.440 11.360 21.200
Mean 3.614 11.364 11.137 0.555 6.285 68.575 3.795 9.549 408.237 18.456 356.674 12.653 22.533
3rd Q 3.677 12.500 18.100 0.624 6.623 94.075 5.188 24.000 666.000 20.200 396.225 16.955 25.000
Max 88.976 100.000 27.740 0.871 8.780 100.000 12.127 24.000 711.000 22.000 396.900 37.970 50.000
#Hidden
ggpairs(df,
  title = "Study variable overview",
  upper = list(continuous = wrap("cor", size = 2)),
  lower = list(
    continuous = wrap("points", alpha = .2, size = .3),
    combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
                  angle = 90,
                  color = "black",
                  size = 6,
                  vjust = .5),
      axis.text.y = element_text(color = "black", size = 6))

The data is quite spread out.

4 Scaling and sampling

Scaling

# Scale function adds attributes, and ggpairs doesn't like it
df %<>% mutate_at(vars(-chas), funs(scale)) %>%
  mutate_at(vars(-chas), funs(as.vector)) 

df$crim <- ntile(df$crim, 4) %>% 
  factor(labels = c("low", "med-lo", "med-hi", "high"))
table(df$crim)
## 
##    low med-lo med-hi   high 
##    127    126    127    126
#Hidden
summaryKable(df) %>% 
  kable("html", align = "rrr", caption = "Data variable summary") %>%
  kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
  scroll_box(width = "100%")
Data variable summary
crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
Min Levels -0.487 -1.556 Levels -1.464 -3.876 -2.333 -1.266 -0.982 -1.313 -2.705 -3.903 -1.530 -1.906
1st Q low: 127 -0.487 -0.867 0: 471 -0.912 -0.568 -0.837 -0.805 -0.637 -0.767 -0.488 0.205 -0.799 -0.599
Median med-lo: 126 -0.487 -0.211 1: 35 -0.144 -0.108 0.317 -0.279 -0.522 -0.464 0.275 0.381 -0.181 -0.145
Mean med-hi: 127 0.000 0.000 -0.000 -0.000 -0.000 0.000 0.000 0.000 -0.000 -0.000 -0.000 -0.000
3rd Q high: 126 0.049 1.015 0.598 0.482 0.906 0.662 1.660 1.529 0.806 0.433 0.602 0.268
Max 3.800 2.420 2.730 3.552 1.116 3.957 1.660 1.796 1.637 0.441 3.545 2.987
#Hidden
ggpairs(df,
  title = "Study variable overview",
  upper = list(continuous = wrap("cor", size = 2)),
  lower = list(
    continuous = wrap("points", alpha = .2, size = .3),
    combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
                  angle = 90,
                  color = "black",
                  size = 6,
                  vjust = .5),
      axis.text.y = element_text(color = "black", size = 6))

asldkfj

Sampling to test and train sets

sample_ind <- sample(nrow(df),  size = nrow(df) * 0.8)
train <- df[sample_ind,]
test <-  df[-sample_ind,]
data.frame(dim(train),dim(test))
##   dim.train. dim.test.
## 1        404       102
## 2         14        14

asdf

5 Linear discriminant analysis

Model and plot

lda.fit <- lda(crim ~ ., data = train)
lda.fit
## Call:
## lda(crim ~ ., data = train)
## 
## Prior probabilities of groups:
##       low    med-lo    med-hi      high 
## 0.2524752 0.2400990 0.2549505 0.2524752 
## 
## Group means:
##                 zn      indus      chas1        nox          rm        age
## low     0.96869081 -0.9275539 0.02941176 -0.8800690  0.43995584 -0.8752278
## med-lo -0.09029515 -0.2586918 0.06185567 -0.5591144 -0.14755635 -0.3011476
## med-hi -0.40315105  0.1828882 0.12621359  0.3681828  0.02757593  0.3882566
## high   -0.48724019  1.0171096 0.05882353  1.0742484 -0.40431061  0.8165134
##               dis        rad        tax     ptratio      black
## low     0.9152683 -0.7093918 -0.7269121 -0.43457339  0.3767300
## med-lo  0.3378759 -0.5520841 -0.4459848 -0.02565127  0.3065652
## med-hi -0.3745986 -0.3853374 -0.2916834 -0.21534639  0.1212889
## high   -0.8653019  1.6382099  1.5141140  0.78087177 -0.7919814
##               lstat        medv
## low    -0.757964495  0.51545824
## med-lo -0.106177195 -0.01937209
## med-hi -0.008382417  0.16554488
## high    0.912394629 -0.72576793
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.10596238  0.716313800 -0.82222685
## indus   -0.01050514 -0.243641309  0.39393152
## chas1   -0.05493639 -0.258225563  0.33850607
## nox      0.38995586 -0.679864517 -1.52696583
## rm       0.06508271  0.047111185 -0.15132783
## age      0.22825740 -0.269001654 -0.11292503
## dis     -0.11516205 -0.190164336  0.07435738
## rad      2.92998164  0.842304267 -0.13089022
## tax     -0.00351874  0.079643460  0.62779837
## ptratio  0.12484708 -0.023702396 -0.27412428
## black   -0.11230547 -0.007179224  0.08311835
## lstat    0.25428690 -0.211257208  0.33546281
## medv     0.06150407 -0.467287727 -0.26362829
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9461 0.0399 0.0140

I made it 3D for fun

#Hidden
# plot the lda results
points <- data.frame(crim = train$crim,
                     lda = predict(lda.fit)$x)
levels(points$crim) %<>%  str_to_title()

arrows <- coef(lda.fit) %>% 
  data.frame(., label = rownames(.)) %>% arrange(desc(abs(LD1))) %>% 
  mutate(LD1 = LD1*2.5, LD2 = LD2*2.5, LD3 = LD3*2.5, pos = 1) %>% 
  rbind(., mutate(., LD1=0, LD2=0, LD3=0, pos =0)) 


p1 <- plot_ly(arrows, x = ~LD1, y = ~LD2, z = ~LD3, 
  type = "scatter3d" , color = ~label, colors = rep(rgb(0, 0, 0), 13),
  opacity = .5, mode = "lines", hoverinfo = "name", showlegend = FALSE, 
  line = list(width = 5))

p2 <- plot_ly(points, x = ~lda.LD1, y = ~lda.LD2, z = ~lda.LD3, 
    type = "scatter3d" , color = ~crim, opacity = .5, hoverinfo = "none",
    mode = "markers", marker = list(size = 3, width = 2)) %>% 
  layout(title = "PCA",
       scene = list(xaxis = list(title = "LDA1"),
                    yaxis = list(title = "LDA2"),
                    zaxis = list(title = "LDA3")))

subplot(p1, p2)

6 Testing the model.

table("Crime" = test$crim, 
      "Prediction" = predict(lda.fit, newdata = test)$class)
##         Prediction
## Crime    low med-lo med-hi high
##   low     14      9      2    0
##   med-lo   7     17      5    0
##   med-hi   1      5     17    1
##   high     0      0      0   24

Many comments here

7 Kmeans

Finding optimal number

data("Boston")
df <- Boston %>% tbl_df
df %<>% mutate_at(vars(-chas), funs(scale)) %>% 
  mutate_at(vars(-chas), funs(as.vector)) 
  
cbind(Bost = summary(dist(Boston)), df = summary(dist(df))) %>% t()
##        Min. 1st Qu.  Median    Mean 3rd Qu.   Max.
## Bost 1.1190  85.620 170.500 226.300 371.900 626.00
## df   0.1343   3.266   4.612   4.727   5.957  13.88
km <- kmeans(df, centers = 3)

twcss <- sapply(1:20, function(k) kmeans(df, k)$tot.withinss)
qplot(x = 1:20, y = twcss, geom = 'line')

Based on this, optimal value around 3 centers.

Km with optimal number of clusters

km <- kmeans(df, centers = 3)
#Hidden
df %>% mutate(clust = factor(km$cluster)) %>%
  gather(., key = "var", value = "value", -c(crim, clust)) %>%
  ggplot(aes(value, crim, color = clust)) + 
  geom_point(shape = 1, size = 1.5, stroke = 1, alpha = .3) + 
  facet_wrap(~var, scales = "free_x") 

Tax,

Bonus

# Add clusters of 3-centered kmeans to normalized boston dataset and remove variable crim
df %<>% mutate(clust = factor(km$cluster)) %>% select(-crim)
lda.clust <- lda(clust ~ ., data = df)
lda.clust
## Call:
## lda(clust ~ ., data = df)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.4664032 0.3241107 0.2094862 
## 
## Group means:
##           zn     indus       chas        nox          rm         age
## 1 -0.3417123 -0.296848 0.07203390 -0.3345884 -0.09228038 -0.02966623
## 2 -0.4872402  1.117990 0.07317073  1.1253988 -0.46443119  0.79737580
## 3  1.5146367 -1.068814 0.05660377 -0.9962503  0.92400834 -1.16762641
##           dis        rad        tax     ptratio      black      lstat
## 1  0.05695857 -0.5803944 -0.6030198 -0.08691245  0.2863040 -0.1801190
## 2 -0.85425848  1.2219249  1.2954050  0.60580719 -0.6407268  0.8719904
## 3  1.19486951 -0.5983266 -0.6616391 -0.74378342  0.3538816 -0.9480974
##          medv
## 1  0.03577844
## 2 -0.68418954
## 3  0.97889973
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## zn      -0.06485555  1.23950502
## indus    0.61254453  0.09577712
## chas     0.07926869 -0.15367696
## nox      1.00795668  0.68195436
## rm      -0.16292603  0.44848875
## age     -0.07366553 -0.59605109
## dis     -0.03419212  0.41832720
## rad      0.70132328  0.11664032
## tax      0.98650674  0.69358984
## ptratio  0.22769090  0.14455413
## black   -0.01452242 -0.04363522
## lstat    0.18056486  0.51577363
## medv    -0.02090320  0.60976280
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8425 0.1575
#Hidden
points <- data.frame(clust = df$clust,
                     lda = predict(lda.clust)$x)
names(points) <- c("Cluster", "LD1", "LD2")

arrows <- coef(lda.clust) %>% 
  data.frame(., label = str_to_upper(rownames(.))) %>% 
  arrange(desc(abs(LD1))) %>% 
  # Scale the arrows 
  mutate(LD1 = LD1*5, LD2 = LD2*5, pos = 1) 

ggplot(points) +
  theme_minimal() +
  geom_point(aes(LD1, LD2, color = Cluster), 
             shape = 1, size = 2, stroke = 1.5, alpha = .75) +
  geom_segment(data = arrows, 
               aes(y=0, x=0, yend=LD2, xend=LD1, alpha = .5), 
               arrow = arrow(length = unit(0.075, "inches")),
               show.legend = F) +
  # Adjust the labels .2 units away from the arrowhead
  geom_text(data = arrows, aes(x = LD1+.2*(LD1/sqrt(LD1^2+LD2^2)), 
                               y = LD2+.2*(LD2/sqrt(LD1^2+LD2^2)), 
                               hjust = .5,
                               label = label), 
            show.legend = F) 

The strongest predictors of clusters are tax, nox, zn, and age.

Superbonus

model_predictors <- dplyr::select(train, -crim) %>% 
  mutate(chas = as.numeric(chas))

# check the dimensions
data.frame(dim(model_predictors), dim(lda.fit$scaling))
##   dim.model_predictors. dim.lda.fit.scaling.
## 1                   404                   13
## 2                    13                    3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product) 
head(matrix_product)
##         LD1         LD2        LD3
## 1 -2.434675 -0.74506050  1.1272297
## 2 -2.074241 -0.48694249  0.4409222
## 3 -1.213673  0.64642383  0.1111085
## 4  5.733068 -0.04615067 -0.2150409
## 5 -2.056799  0.82384756  0.9980802
## 6 -2.584096 -0.12738163  0.6687811
#Hidden
p1 <- plot_ly(data = matrix_product, x = ~LD1, y = ~LD2, z = ~LD3, 
        type= 'scatter3d', mode='markers', color = train$crim, 
        marker = list(size = 3, width = 2), scene = "scene1") 

# I run the Kmeans modeling with all observations, so I need to either adjust by selecting only relevant observations 
train_km <- as.data.frame(km$cluster)[sample_ind,] %>% as.factor()

p2 <- plot_ly(data = matrix_product, x = ~LD1, y = ~LD2, z = ~LD3, 
        type= 'scatter3d', mode='markers', color = train_km,
        marker = list(size = 3, width = 2), scene = "scene2")

subplot(p1,p2) %>% 
  layout(scene1 = list(domain=list(x=c(0,0.5),y=c(0,1)), 
                       xaxis = list(title = "LD1"), 
                       yaxis = list(title = "LD2"),
                       zaxis = list(title = "LD3")),
         scene2 = list(domain=list(x=c(0.5,1),y=c(0,1)),
                       xaxis = list(title = "LD1"), 
                       yaxis = list(title = "LD2"),
                       zaxis = list(title = "LD3")))

Most obvious difference is of course that the first lda-plot has 4 classes whereas the km-plot only has 3. Clustering is done fairly similarly, both models easily distinguish the group with very positive LD1 values. Kmeans seems to achieve (at least visually judging) better separation in the other, larger, cluster.


Dimensionality reduction exercise

The data

1. Load and describe data

Code for data wrangling can be found here

#Hidden
source("helper_functions.R")
library(magrittr)
library(GGally)
library(stringr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(tidyr)
library(MASS)
library(dplyr)
df <- read.csv(file = "data/human.csv", row.names = 1) 
glimpse(df)
## Observations: 155
## Variables: 8
## $ eduRatio      <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0.96...
## $ labRatio      <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0.82...
## $ expEdu        <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5, ...
## $ lifeExp       <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1, ...
## $ GNIC          <dbl> 64.992, 42.261, 56.431, 44.025, 45.435, 43.919, ...
## $ matMort       <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27, ...
## $ adolBirthRate <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5, ...
## $ reprParl      <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4, ...

Dataset contains information of human development charasteritics. It totals 189 observations (countries) and 8 variables.

2. Graphical overview of data

#Hidden
df %>% summaryKable() %>% 
  kable("html", align = "rrr", caption = "Data variable summary") %>%
  kable_styling(bootstrap_options = c("hover", "condensed")) %>% 
  scroll_box(width = "100%")
Data variable summary
eduRatio labRatio expEdu lifeExp GNIC matMort adolBirthRate reprParl
Min 0.172 0.186 5.400 49.000 1.123 1.000 0.600 0.000
1st Q 0.726 0.598 11.250 66.300 5.275 11.500 12.650 12.400
Median 0.938 0.753 13.500 74.200 13.054 49.000 33.600 19.300
Mean 0.853 0.707 13.176 71.654 46.496 149.084 47.159 20.912
3rd Q 0.997 0.854 15.200 77.250 27.256 190.000 71.950 27.950
Max 1.497 1.038 20.200 83.500 908.000 1100.000 204.800 57.500
# Hidden
ggpairs(df,
  title = "Study variable overview",
  upper = list(continuous = wrap("cor", size = 3)),
  lower = list(
    continuous = wrap("points", alpha = .2, size = .4),
    combo = wrap("facethist", bins = 20))) +
theme(axis.text.x = element_text(
                  angle = 90,
                  color = "black",
                  size = 6,
                  vjust = .5),
      axis.text.y = element_text(color = "black", size = 6))

Many of the variables are normally distributed, but with eduRatio, GNIC and matMort, some problems might arise if normality is assumed.

#Hidden
cor(df, use = "pairwise.complete.obs") %>% 
  corrplot::corrplot(method = "number", type = "upper", 
                     tl.pos = "td", tl.srt = 25, 
                     tl.offset = 1, tl.cex = .7,
                     number.cex = .75, 
                     title = "Correlation between study variables",
                     mar=c(0,0,1,0),
                     diag = F) 

According to above plot, education correlates strongly and positively with life expectancy, and negatively with maternal mortality and adolescent births. Maternal mortality correlates strongly and negatively with education ratio, life expectancy and education.

Principal component analysis

3. PCA on non-normalized data

pca <- prcomp(df)
pca %>% summary()
## Importance of components:
##                            PC1      PC2      PC3      PC4     PC5     PC6
## Standard deviation     227.566 121.2798 26.48071 11.47526 3.89773 1.60859
## Proportion of Variance   0.769   0.2184  0.01041  0.00196 0.00023 0.00004
## Cumulative Proportion    0.769   0.9874  0.99778  0.99974 0.99996 1.00000
##                           PC7    PC8
## Standard deviation     0.1915 0.1598
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000
#Hidden
biplot(pca, choices = 1:2)

4. PCA on normalized data

df_scaled <- scale(df)
pca <- prcomp(df_scaled)
pca %>% summary
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     1.9970 1.1465 0.9439 0.85645 0.69002 0.53246
## Proportion of Variance 0.4985 0.1643 0.1114 0.09169 0.05952 0.03544
## Cumulative Proportion  0.4985 0.6628 0.7742 0.86585 0.92536 0.96080
##                            PC7     PC8
## Standard deviation     0.46511 0.31184
## Proportion of Variance 0.02704 0.01216
## Cumulative Proportion  0.98784 1.00000
#Hidden
biplot(pca, choices = 1:2)

5. Interpret differences

6. MCA with tea dataset

library(FactoMineR)
data(tea)
df <- tea

mca <- df %>% select(breakfast, friends, resto, Tea, Sport) %>%
  MCA(graph = FALSE)
mca %>% summary
## 
## Call:
## MCA(X = ., graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.268   0.235   0.213   0.174   0.160   0.149
## % of var.             22.370  19.550  17.778  14.522  13.358  12.421
## Cumulative % of var.  22.370  41.920  59.698  74.221  87.579 100.000
## 
## Individuals (the 10 first)
##                  Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1             |  0.529  0.348  0.198 | -0.982  1.371  0.684 |  0.302
## 2             |  0.529  0.348  0.198 | -0.982  1.371  0.684 |  0.302
## 3             | -0.753  0.704  0.517 |  0.256  0.093  0.060 | -0.242
## 4             |  0.367  0.167  0.130 |  0.372  0.197  0.133 |  0.220
## 5             |  0.361  0.162  0.143 | -0.618  0.543  0.419 | -0.446
## 6             |  0.384  0.183  0.167 | -0.007  0.000  0.000 | -0.351
## 7             | -0.150  0.028  0.035 | -0.303  0.131  0.144 | -0.600
## 8             |  0.552  0.378  0.221 | -0.372  0.196  0.100 |  0.397
## 9             |  0.361  0.162  0.143 | -0.618  0.543  0.419 | -0.446
## 10            |  0.513  0.326  0.167 | -0.603  0.517  0.231 |  0.873
##                  ctr   cos2  
## 1              0.143  0.065 |
## 2              0.143  0.065 |
## 3              0.091  0.053 |
## 4              0.075  0.046 |
## 5              0.310  0.218 |
## 6              0.193  0.140 |
## 7              0.563  0.563 |
## 8              0.246  0.114 |
## 9              0.310  0.218 |
## 10             1.190  0.485 |
## 
## Categories (the 10 first)
##                   Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## breakfast     |  -0.031   0.034   0.001  -0.515 |  -0.769  24.200   0.546
## Not.breakfast |   0.029   0.032   0.001   0.515 |   0.710  22.339   0.546
## friends       |  -0.459  10.250   0.397 -10.893 |   0.264   3.885   0.131
## Not.friends   |   0.865  19.318   0.397  10.893 |  -0.498   7.322   0.131
## Not.resto     |   0.427   9.993   0.509  12.341 |   0.032   0.066   0.003
## resto         |  -1.194  27.955   0.509 -12.341 |  -0.091   0.185   0.003
## black         |   0.093   0.158   0.003   0.918 |  -0.774  12.612   0.196
## Earl Grey     |  -0.343   5.654   0.213  -7.976 |   0.108   0.641   0.021
## green         |   1.801  26.573   0.401  10.946 |   1.104  11.439   0.151
## Not.sportsman |  -0.025   0.019   0.000  -0.362 |   0.548  10.329   0.203
##                v.test     Dim.3     ctr    cos2  v.test  
## breakfast     -12.776 |  -0.114   0.580   0.012  -1.886 |
## Not.breakfast  12.776 |   0.105   0.536   0.012   1.886 |
## friends         6.270 |  -0.124   0.938   0.029  -2.937 |
## Not.friends    -6.270 |   0.233   1.767   0.029   2.937 |
## Not.resto       0.938 |  -0.161   1.782   0.072  -4.646 |
## resto          -0.938 |   0.449   4.985   0.072   4.646 |
## black          -7.663 |   1.270  37.305   0.528  12.567 |
## Earl Grey       2.510 |  -0.457  12.585   0.376 -10.608 |
## green           6.714 |  -0.177   0.322   0.004  -1.073 |
## Not.sportsman   7.792 |   0.787  23.390   0.418  11.182 |
## 
## Categorical variables (eta2)
##                 Dim.1 Dim.2 Dim.3  
## breakfast     | 0.001 0.546 0.012 |
## friends       | 0.397 0.131 0.029 |
## resto         | 0.509 0.003 0.072 |
## Tea           | 0.435 0.290 0.536 |
## Sport         | 0.000 0.203 0.418 |
#Hidden
mca %>% plot(invisible=c("ind"))